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We have developed a novel structure-based evaluation for missense variants that explicitly 
models protein structure and amino acid properties to predict the likelihood that a variant 
disrupts protein function. A structural disruption score (SDS) is introduced as a measure 
to depict the likelihood that a case variant is functional. The score is constructed using 
characteristics that distinguish between causal and neutral variants within a group of 
proteins. The SDS score is correlated with standard sequence-based deleteriousness, 
but shows promise for improving discrimination between neutral and causal variants 
at less conserved sites. The prediction was performed on 3-dimentional structures of 
57 gene products whose homozygous SNPs were identified as case-exclusive variants 
in an exome sequencing study of epilepsy disorders. We contrasted the candidate 
epilepsy variants with scores for likely benign variants found in the EVS database, and 
for positive control variants in the same genes that are suspected to promote a range of 
diseases. To derive a characteristic profile of damaging SNPs, we transformed continuous 
scores into categorical variables based on the score distribution of each measurement, 
collected from all possible SNPs in this protein set, where extreme measures were 
assumed to be deleterious. A second epilepsy dataset was used to replicate the findings. 
Causal variants tend to receive higher sequence-based deleterious scores, induce larger 
physico-chemical changes between amino acid pairs, locate in protein domains, buried 
sites or on conserved protein surface clusters, and cause protein destabilization, relative 
to negative controls. These measures were agglomerated for each variant. A list of nine 
high-priority putative functional variants for epilepsy was generated. Our newly developed 
SDS protocol facilitates SNP prioritization for experimental validation. 
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INTRODUCTION 

Several prediction programs are available to evaluate missense 
variants as either deleterious (having a strong functional effect) or 
neutral (having no or only a weak functional effect) from the level 
of DNA or protein sequence conservation (Cooper and Shendure, 
2011). While existing sequence-based damaging scores agree for 
the most deleterious variants, predictions for candidate moderate 
effect variants identified from sequencing studies are not much 
better than chance. Since there is no clear way to truly evaluate the 
predictive accuracy of the scores prior to experimental assessment 
of function, there is scope for development of orthogonal meth- 
ods for variant prioritization. Our study explores the utility of 
solely using protein structure-based assessments as a complement 
to existing sequence-based scores. 

Of the commonly used automatic tools for prediction of vari- 
ant deleteriousness, PolyPhen2 (Adzhubei et al., 2010) already 
incorporates protein structure information. It uses an iterative 
greedy algorithm to select certain features from a restricted 
training set, and then takes a Bayesian approach to assign each 
variant into one of four effect categories: probably damaging, 
possibly damaging, benign, and unknown. However, it does not 
perform evaluations on the actual protein structure that each 



variant is found in. Rather, PolyPhen2 includes experimentally 
derived-structures that are available for ~10% of the training set. 
Although the implementation has high accuracy (73-92%) for the 
identification of true positives in cross-validation data, structural 
data does not directly contribute to evaluations of novel genes and 
it is not clear how efficiently the generalized structural character- 
istic rules used by the algorithm can contrast clinically-associated 
variants from neutral variants in a diverse gene set. 

In this study, we therefore introduce a new approach for assess- 
ing the deleteriousness of non-synonymous single nucleotide 
polymorphisms (nsSNPs). Our newly developed protocol uses 
additional information, that is, protein structure-based assess- 
ments applied only where the structural solution is available, 
to complement existing sequence-based scores. More specifi- 
cally, our evaluation pipeline focuses on functionality of protein 
residues derived from 3-dimensional (3D) protein structures. We 
also incorporate multiple classes of structural assessment, namely 
measures of protein stability, flexibility, protein-protein interac- 
tion potential, and small-molecular binding. As several studies 
(Capriotti and Altman, 2011; fordan et al, 2011) have proven 
that structural information increases classification accuracy of 
SNPs, we hypothesized that by incorporating results from several 
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Structure-based assessments, it may be possible to generate char- 
acteristic profiles that enhance prediction of the degree to which 
a candidate rare variant may disrupt protein function, and lead to 
disease development. 

We applied this newly developed variant assessment protocol 
to a set of 57 gene products harboring homozygous missense vari- 
ants, discovered in a recent large-scale exome sequencing study, 
that are exclusive to epilepsy patients (Heinzen et al, 2012). 
Epilepsy is a highly genetically heterogeneous disease, for which 
each likely causal variant is observed in a small fraction of indi- 
viduals, likely with variable expressivity and penetrance (Noebels, 
2003). As a result, it is difficult to ascertain which variants are 
truly responsible for the etiology of disease in individual patients. 
None of the case-exclusive variants documented by Heinzen et al. 
(2012) had a high enough prevalence to support statistical associ- 
ation with the disease, so experimental tests will be needed to filter 
putative causal variants. By contrasting the spectrum of struc- 
tural features of the case variants with positive control known 
causal variants and negative control neutral variants observed 
in healthy individuals for the same proteins, we illustrate the 
potential for structural assessment to prioritize new variants for 
functionalization. 

MATERIALS AND METHODS 

Our analysis pipeline applied sequence- and structure-based 
assessments to missense mutations and their 3D protein struc- 
tures to depict the likelihood that a mutation disrupts protein 
function. Numerous databases and prediction programs were 
used. The flow diagram of the analysis protocol is illustrated in 
Figure 1. 

GENOMIC DATASET AND CANDIDATE PROTEIN SEOUENCES 

The epilepsy-specific amino acid substitutions identified from a 
recent exome sequencing study of epilepsy disorders (Heinzen 
et al, 2012) served as our case variants for which we aimed to 
assess whether or not they are likely to impact protein function. 
In that study, exome sequencing was performed on 1 18 cases and 
242 controls. Follow-up genotyping for candidate causal vari- 
ants included approximately 90 and 65% of individuals with 
European ancestry in the case (« = 878) and control (n = 1830) 
groups, respectively (Heinzen et al., 2012). The study identified 72 
homozygous variants (68 are nsSNPs) found in 71 genes ("gene 
set 1") that were exclusive to cases. Among these, 52 nsSNPs were 
present in more than one affected individual. All genes in this 
first dataset had been previously characterized but not known 
to cause epilepsy; therefore, we added a second gene set ("gene 
set 2") to represent genes known to associate with the disorders. 
We attained the second gene list (n = 41 genes) from two pub- 
lic repositories of genetic variations: MSV3d (Luu et al., 2012) 
and SwissVar (Mottaz et al., 2010); none of the genes overlap with 
entries from the primary dataset. There are 373 missense vari- 
ants in the 41 genes that have been documented to cause epilepsy; 
therefore, we treated them as case variants for gene set 2. 

For both sets of genes, we compiled corresponding nega- 
tive neutral and positive causal variants from the EVS database 
(retrieved March 2013) (NHLBI Exome Sequencing Project, 
2013), and MSV3d (July 2012 release) (Luu et al, 2012) and 



SwissVar (accessed February 2013) (Mottaz et al., 2010), respec- 
tively. Positive controls are documented non-epileptic disease- 
causing nsSNPs found in the same genes (n = 134 nsSNPs from 
14 genes of set 1, and n = 205 nsSNPs from 41 genes of set 2). 
Likewise, negative controls are variants observed in these genes, 
but with no clinical associations (neutral nsSNPs). Any negative 
controls already identified as either case or positive SNPs were 
excluded from the list of neutral SNPs, resulting in 5281 and 1490 
putatively neutral (i.e., negative control) SNPs for sets 1 and 2, 
respectively. 

GENE AND VARIANT ANNOTATIONS 

In order to infer amino acid indices for the altered amino 
acid residues, nsSNPs were mapped to their corresponding pro- 
tein sequences and structures using transcript IDs. All protein 
sequences (major isoforms) were downloaded from the UniProt 
database (accessed February 2013) (Uniprot Consortium, 2012). 
Prior to applying our new variant analysis protocol, we performed 
literature searches on the genes and SNPs in our datasets in order 
to manually annotate their influence on the disease. In particu- 
lar, we compared the features of gene sets 1 and 2, and recorded 
relevant findings. 

First we grouped genes by their related biological path- 
ways or biological functions using a gene group profiling 
method (Reimand et al, 2011). Second, we performed literature 
searches using SNPshot — a text mining tool for PubMed abstracts 
(accessed December 2012) (Hakenberg et al., 2012). Third, we 
assumed that amino acid mutations caused by the rare case SNPs 
or the causal SNPs would locate in the vicinity of functional 
sites of protein chains. Therefore, we utilized UniProt's sequence 
feature records (accessed February 2013) (Uniprot Consortium, 

2012) to check if the mutating amino acids locate in any of 
the important sites, e.g., molecule processing sites, binding sites, 
modification sites, etc. 

Population-specific minor allele frequencies (MAFs) for all 
variants were compiled from NHLBI GO Exome Sequencing 
Project (ESP6500) (June 2012 release), available from dbNSFP 2.0 
(accessed March 2013) (Liu et al, 2011). 

PROTEIN STRUCTURE DATASET 

We used protein 3D structures to determine the structural nature 
of altered protein residues and to evaluate the effects of sin- 
gle point mutations introduced by nsSNPs on a specific protein. 
To ensure that we represent most of the proteins with high 
quality 3D structures, we employed several structural sources. 
Experimentally derived structures were retrieved from RCSB 
(retrieved April 2013) (Bernstein et al., 1977). Homology mod- 
els were compiled from SAHG (retrieved July 2013) (Motono 
et al., 2011) or automatically built using Phyre2 (accessed April 

2013) (Kelley and Sternberg, 2009). Multiple structural candi- 
dates representing an overlapping protein chain were compared 
and only one best structure was chosen to represent the best 
non-overlapping protein segment. 

Details of the two approaches for acquiring protein homol- 
ogy models are as follows. First, we searched for 3D models 
from the SAHG database (Motono et al., 2011), which contains 
a collection of encoded human protein structures, constructed 
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Datasets: 

- Exclusive Case SNPs from epilepsy study (exome sequencing) (Heinzen et. al, 2012) 

- Neutral (Negative) SNPs reside in the same genes as Cases (EVS) 

- Disease Causal (Positive) SNPs reside in the same genes as Cases (MSVSd, SwissVar) 



Deleterious scores of SNPs: 

- SIFT, PolyPhen_HDIV, PolyPhen_HVAR, LRT, MutationTasser, MutationAssessor 

Other parameters for amino acid changes: 

- Grantham Scores, glycine/proline changes, disulfide changes 



^ 



SNPs mapped to protein structure & structure-based predictions 



Protein stability 

- Residues with 
stabilization roles 
(SCide, SRide) 

- Residues induce 
large stability 
changes (PopMusIc) 



Protein flexibility 

- Dynamic sites 
(PredyFlexy) 

- Flexible sites 

(FlexPred) 




Liqand binding 
capability 

- Binding residues 
{3DLigandSlte) 

- Low sequence 
optimality residues 
(PopMusic) 



Protein-protein 
interactions 
- Patch residues on 
protein surfaces 

(PatchFinder) 



Enrichment analysis of causal vs. neutral SNPs 
& construction of Structural Disruption Score (SDS) for case SNPs 



FIGURE 1 I Flow diagram of the analysis pipeline. Tlie analysis employed 
sequence-based deleterious prediction scores, parameters which reflect the 
nature of amino acid changes, and 3D structure-based evaluations. Structural 
analyses were performed by characterizing functionality of mutated protein 
residues caused by negative and positive SNPs (indicated by green and red 



stick representations, respectively). All analysis results were collectively used 
to evaluate enriched features found predominantly in causal SNPs. We then 
examined these predictors with regard to the case variants. The number of 
deleterious structure predictions per substitution represents a "structural 
disruption score" (SDS), and was used to rank candidate epilepsy variants. 



by Modeller software (Sali and BlundeU, 1993). We downloaded 
only structures having >15% sequence identity to the template. 
The retrieved proteins exhibit either ligand bound (holo) and/or 
unbound (apo) forms. Second, we built protein models by multi- 
ple template methodology using the automated Phyre2 homology 
modeling server (Kelley and Sternberg, 2009). Structure tem- 
plates were selected by default and models were built from vari- 
able numbers of high confidence templates. This multi-template 
approach ensures that the model covers most of the protein chain. 
Large proteins (>1200 amino acids) were truncated into smaller 
domain(s) using domain boundary information from Interpro 
(QuevUlon et al., 2005). A model representing each shortened 



sequence was built independently using either the single- or 
multi-template method; there was no attempt to join multiple 
models into a single model for a protein. For models created with 
the Phyre2 server, we retained the best homology model based on 
the empirical criteria that >50% of the residues were modeled 
at >90% confidence. 

After the initial homology model selection, the models were 
further subjected to energy minimization with explicit solvent 
using the YASARA force field (Krieger et al, 2009) to resolve any 
steric conflicts found within the structures. Next, we validated 
the homology models using two independent scores: QMEAN6 
(Benkert et al, 2009) and ModFOLD4 (McGuffin et al, 2013). 
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Both scores show good ability to distinguish between good and 
bad models in the recent Critical Assessment of protein Structure 
Prediction (GASP) experiments (Kryshtafovych et al, 2014). To 
facilitate the structural validation step, we selected structures 
that pass the QMEAN6 threshold for subsequent ModFOLD4 
evaluations. 

In many cases, we initially selected more than one validated 
structure to represent an identical protein domain. To retain only 
one best representative structure for a protein segment, we used 
Chimera (Pettersen et al., 2004) to visualize aU structure can- 
didates and determined the structural similarity among them 
using two parameters: root-mean-square deviation (RMSD) of 
Cc atoms, and quality score (Q-score) that normalizes an RMSD 
by the alignment length. All measurements were performed with 
Chimera's MatchMaker tool (Pettersen et al., 2004). When sev- 
eral overlapping structures agreed with each other, we selected the 
one with the best ModFOLD4 score. When the structures were in 
disagreement, we discarded them all together. Our retrieval and 
validation pipeline for protein 3D structures yielded 114 non- 
overlapping structures representing 57 gene products from gene 
set 1, and 51 non-overlapping structures representing 36 proteins 
from gene set 2. 

Table 1 summarizes the number of missense variants from 
our genomic dataset in three categories (case, negative, and pos- 
itive controls), with respect to the presence/absence of their 
corresponding 3D structures. 

INFERRING VARIANT DELETERIOUSNESS FROM SEQUENCE-BASED 
PREDICTORS 

We obtained sequence-based predictions for each amino acid 
variant from dbNSFP 2.0 (accessed March 2013) (Liu et al, 201 1). 
The program provides precomputed deleteriousness scores for six 
established deleterious prediction algorithms: SIFT (Kumar et al., 
2009), PolyPhen2_HumDiv and PolyPhen2_HumVar (Adzhubei 
et al., 2010), LRT (Chun and Fay, 2009), MutationTaster 
(Schwarz et al., 2010), and MutationAssessor (Reva et al, 2011). 
Three evolutionary conservation-based scores were also included: 
GERP-|~|- (Davydov et al, 2010), phyloP (Pollard et al, 2010), 
and SiPhy (Lindblad-Toh et al., 201 1). For simplicity, we assigned 
the level of deleteriousness and conservation to each mutation 
based on how many predictors reported the mutation to be either 
"deleterious" (maximum score of 6) or "conserved" (maximum 
score of 3). 

ADDITIONAL PARAMETERS FOR SEQUENCE-BASED ANALYSIS 

In addition to the SNP-based prediction parameters that are 
derived from multiple sequence alignments, some useful infor- 
mation can be analyzed from a single protein sequence alone. For 
example, amino acids with similar physicochemical properties 
may substitute for one another while maintaining the function- 
ality of the protein. Three indicators may be used to highlight 
the most severe changes of amino acid pairs. First, Grantham 
scores (Grantham, 1974) reflect the degree of physico-chemical 
difference between pairs of amino acids. Second, changes involv- 
ing any glycine or proline residues are likely to affect protein 
function since these two residues have special roles with regard 
to protein structure: proline has an exceptional conformational 
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rigidity compared to other amino acids whUe glycine is much 
more conformationally flexible (Gunasekaran et al, 1998). Third, 
gain or loss of disulfide bonds occurs when variants induce 
changes in cysteine residues. Disulfide bond formation between 
non-adjacent cysteines can facilitate protein folding; hence, they 
are important for maintaining the structural integrity of the 
protein (Darby and Creighton, 1995). In this context, we used 
DiANNA webserver (Ferre and Clote, 2006) to predict the disul- 
fide connectivity patterns in the wild type protein, and then deter- 
mined if the amino acid mutation affects the bonding of cysteine 
pairs. 

INFERRING VARIANT DELETERIOUSNESS FROM STRUCTURE-BASED 
PREDICTORS 

A number of currently available protein structural analysis tools 
have the potential to be applied to structure-based variant assess- 
ment protocols (Verma et al., 2012). To assess the functionality 
of mutated protein residues, we concentrated on four features of 
structural analysis: protein stability, protein flexibility, protein- 
ligand binding potential, and protein-protein interaction poten- 
tial. Many mutations disrupt these elements, and as a result, 
contribute to disease etiology. 

Protein stability 

For assessment of protein stability, we aimed to first identify 
amino acids with specialized roles in promoting protein stabil- 
ity, and second to determine which mutations cause a significant 
change in protein stability. For the first objective, we used SCide 
webserver (Dosztanyi et al., 1997, 2003) and SRide program 
(Magyar et al., 2005) to identify amino acids with essential sta- 
bility functions. Long-range stabilization center (SC) residues are 
pairs of amino acids having close atomic contact (sum of van 
der Waals radii <1 A), but locate at least ten amino acids apart 
on the primary sequence (Dosztanyi et al., 1997, 2003). A subset 
of SC residues may make distinct contributions to protein sta- 
bility because they are also evolutionary conserved and located 
in the core region of the protein, and/or have many interact- 
ing partners. SC residues with these two extra properties are 
referred as stabilizing residues (SRs); they are also expected to 
make key contributions to protein stability (Magyar et al., 2005). 
For the second objective, we aimed to determine if a partic- 
ular mutation affects protein stability by means of inducing a 
large magnitude of free energy change (A AG). For this purpose, 
we selected PoPMusic 2.1 (Dehouck et al, 2011) as our A AG 
predictor. 

Amino acid changes that increase protein stability (A AG < 0) 
and those associated with the destabilizing mutation (A AG > 
0) are noted. Due to large differences in performance of stabil- 
ity change calculations (Khan and Vihinen, 2010), the proper 
margin for severe stability change can be ambiguous. However, 
it is known that the sensitivity in predicting stabilizing muta- 
tions is much less than for destabilizing ones (Worth et al, 
2011), and the correlation between predicted stability change 
(AAGp) and measured values (AAGm) of our selected pro- 
gram is ~lkcal/mol (Dehouck et al., 2011). Therefore, in our 
study, we followed the suggestions made by Dehouck et al. 
(2011). The stability changes are categorized into four levels: 



no change if A AG is between ±0.5 kcal/mol, mildly stabiliz- 
ing if A AG is between —0.5 and —2 kcal/mol, mildly desta- 
bilizing if A AG is between 0.5 and 4 kcal/mol, and strongly 
destabilizing if A AG is >4 kcal/mol. 

Protein flexibility 

Protein flexibility is an important protein feature because highly 
dynamic sites are often involved in special functions, such as 
binding residues that can undergo subtle motion rearrangements 
when a small molecule is bound. Flexible amino acid residues 
permit large protein movements during protein folding and con- 
formational switches (Teilum et al., 2009). For evaluating the 
levels of residue dynamics within a protein, we employed the 
predicted B-factors (relative vibrational motion) and root-mean- 
square fluctuations (RMSFs) obtained from a prediction pro- 
gram PredyFlexy (de Brevern et al., 2012) to classify amino acid 
residues into rigid, intermediate, or flexible sites. For predicting 
protein movements of higher amplitudes, such as in conforma- 
tional switches, we used the program FlexPred (Kuznetsov, 2008; 
Kuznetsov and McDuffie, 2008) to determine which amino acid 
residues are located at conformationally flexible sites, indicated by 
a probability value P(Flexible). 

Protein-ligand binding potential 

For a SNP that causes an amino acid change in the vicinity of a 
catalytic site or a ligand binding site, it is possible to determine 
whether the mutation is indeed affecting the catalytic activity or 
the ligand binding affinity of the protein. In silica predictions 
are possible, but they require extensive computational resources. 
We utilized two alternative approaches to predict the ligand 
binding sites or catalytic sites from protein 3D structures, and 
assessed whether or not the altered protein residues locate in or 
near the predictions. The first approach began with the use of 
3DLigandSite (Wass et al, 2010) to search for ligands present in 
homologous structures. Then, a cluster of amino acids located 
within a default distance setting of 0.8 A of the selected ligand 
was predicted as a pocket site, and amino acid residues that make 
up that pocket site were specified as ligand binding residues. In 
the second method, catalytic sites were predicted by scanning 
for protein residues that are not well optimized. This assump- 
tion is based on the finding that catalytic sites are generally 
designed for function rather than stability (Dehouck et al., 201 1). 
Low optimality residues are those whose several possible muta- 
tions would improve protein stability. The program PoPMusic 
(Dehouck et al., 2011) is fast enough that it can calculate stabil- 
ity changes (A AGs) of all possible mutations at a given position 
in the protein sequence, and was used to identify non-optimized 
amino acid residues based on the summation of all stabilizing 
A AGs. This parameter designates the degree of non-optimality 
(r) for each amino acid residue along the protein chain. 

Protein-protein interaction potential 

Disease-causing mutations that do not occur in binding sites or 
buried sites are predominantly found on protein interfaces (David 
et al., 2012); therefore, we assessed which of the mutating protein 
residues may be involved in this type of inter-molecular interac- 
tion. We used PatchFinder program (Nimrod et al, 2005, 2008), 
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which evaluates evolutionary conservation scores in conjunc- 
tion with solvent accessibility of protein residues, to determine 
the most significant cluster of conserved residues on the sur- 
face of a protein. This patch indicates possible functional sites of 
protein-protein interactions. 

ADDITIONAL PARAMETERS FOR STRUCTURE-BASED ANALYSIS 

Other information retrieved from the structure-based data 
includes the type of protein secondary structure that each vari- 
ant interrupts, and the relative solvent accessibility (RSA) of 
the altered protein residue. We obtained these predictions from 
PoPMusic (Dehouck et al., 2011). Due to the small sample size, 
we modified the eight reported types of protein secondary struc- 
ture (Kabsch and Sander, 1983) into five groups: (G/H/I), E, 
(T/B), S, and C. The 3-, 4-, and 5-turn helices (groups G, H, and 
I, respectively) were grouped jointly as helices. Extended strand 
in parallel and/or anti-paraUel P-sheet remains as an individual 
group (group E). Groups T or B correspond to helices or sheets 
whose hydrogen bonding patterns are too short to form proper 
secondary structures. Lastly, groups S and C denote bend and coil 
annotations, respectively. 

The RSA for residue X is expressed as a percentage of that 
observed for an Alanine -X- Alanine tripeptide. This conformation 
would expose the central X residue in the tripeptide as much as 
would normally be possible in a protein (Dehouck et al., 2011). 
We considered protein residues whose RSA < 20% as buried sites. 
Otherwise, they are expected to be on the protein surface. 

STATISTICAL COMPARISON OF POSITIVE AND NEGATIVE SNPs 

We tested which predictions and measures can statistically dis- 
tinguish between negative and positive control SNPs from each 
gene group. Particularly, we assessed which characteristics are 
most likely to be enriched in positive controls, and therefore 
imply disruption of functionality. After defining thresholds of 
likely deleterious function, we classified the predicted values for 
negative and positive controls into each different category. The 
categories for structural indicators, the numerical cutoff values, 
and the numbers of mutations with extreme values from the two 
datasets are summarized in Table 2. 

For continuous parameters, we compared the difference 
between the means of the positive and negative controls 
using two-tailed unpaired f-tests (Table 3). The distributions 
of scores within each SNP group were also illustrated by den- 
sity plots (Figures 2, 3). For non-numerical characteristics, we 
used Fisher's exact test to determine whether the proportions of 
negative and positive control SNPs for each of the features are 
significantly different (Table 4). For continuous measures, sim- 
ilar analyses were performed after first transforming the scores 
into discrete categories based on prespecified thresholds that are 
most likely to discriminate normal and aberrant residues. Once a 
series of predictions and measures was generated for all possible 
variants in a gene set, we converted continuous parameters into 
categorical classes, utilizing both literature-based and empirical 
cutoff values to represent the extremes (Table 2). 

For protein stability and sequence optimality measures, we 
followed the suggested thresholds from the PoPMusic program 
(Dehouck et al, 201 1). Stability changes were classified into four 



groups (mildly stabilizing, strongly stabilizing, mildly destabiliz- 
ing, and strongly destabilizing) using the aforementioned cutoffs 
for A AG. Regarding sequence optimality, it has been shown that 
as the threshold for F becomes more stringent, the proportion 
of catalytic sites to other sites increases (Dehouck et al, 2011). 
For our analysis, we selected a cutoff for which residues hav- 
ing r < — 5 kcal/mol are more likely to locate in ligand-binding 
domains. 

For the remaining continuous measures, i.e., B-factor, RMSF, 
and P( Flexibility), we used empirical criteria to define the 
extremes. Extreme values for data set 1 and 2 were derived inde- 
pendently, but with an identical approach. First, we compared 
the score distributions of each measurement, collected from all 
possible SNPs in the protein set. Then, we selected the thresh- 
olds for each parameter so that we captured a handful of extreme 
variants. When applicable, we made sure that our thresholds do 
not induce large numbers of misclassifications. In summary, our 
empirically-defined thresholds are generally set at the top and 
bottom 2.5 percentiles. B-factor and RMSF predictions were clas- 
sified into either highly rigid (extreme small negative values) 
or highly dynamic (extreme large positive values). Similarly, we 
denoted residues as conformational^ rigid if the P( Flexibility) is 
exceptionally low, or conformationally flexible if the probability 
is notably high. 

ASSIGNING A STRUCTURAL DISRUPTION SCORE TO CANDIDATE 
EPILEPSY VARIANTS 

After testing for statistically significant differences between neg- 
ative and positive SNPs, we summarized the list of deleterious 
structure predictions and examined these predictors with regard 
to the case variants. We counted the number of deleterious struc- 
ture predictions per substitution and represented this number as 
a "structural disruption score" (SDS). The scores were ranked 
and candidate epilepsy variants with a score of > 4 out of 7 are 
suggested to be "putative structural disrupted variants." Further 
partitioning of this list based on the gene's tolerance of poly- 
morphism, RVIS (Petrovski et al, 2013), yields two subgroups: 
variants of high tolerance genes (genes that have more variants 
than expected), and variants of low tolerance genes (genes that 
have less variants than expected). These two groups of variants are 
also discussed in detail with respect to their disease implications. 

To examine the contribution of each selected parameter, 
especially the sequence-based deleterious score, toward our 
SDS, we compared the values of SDS with the Condel com- 
posite score (Gonzalez-Perez and Lopez-Bigas, 2011), derived 
from three of the deleteriousness measures (SIFT, PolyPhen, 
and MutationAssessor). The evaluation was performed with a 
step-wise procedure. First, we tested for a correlation between 
the Condel score and SDS — including all parameters from the 
sequence-based and structure-based predictions (total n parame- 
ters). Then, we re-evaluated the correlation using n-1 parameters, 
by excluding one of the SDS components at a time. 

RESULTS 

CANDIDATE GENE AND VARIANT ANNOTATIONS 

Despite the fact that Heinzen et al. (2012) performed path- 
way analysis on 1183 genes harboring either homo- and/or 
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Table 2 | Categories for structural indicators, cutoff values for continuous numerical parameters, and number of SNPs with extreme measures. 



Indicators 


Cutoff 


# Case (%) [n = 30) 


#Neg (%) (n= 1674) 


# Pos (%) (n = 100) 


Stability change 


Stsbilizing'. AAG between —2 and —0.5 kcal/mol 
Strong stabilizing: AAG < —2 kcal/mol 
Destabilizing: AAG between 0.5 and 4 kcal/mol 
Strong destabilizing: AAG > 4 kcal/mol 


V 
0 

20 (65%) 
0 


1 Q M 0/ \ 
1 y [l /of 

0 

830 (50%) 
1 (0%) 


U 
0 

66 (66%) 
0 


Dynamic sites 


Highly rigid: B-factornorm < -0.537 (@2.5%) 
Highly dynamics: B-factornorm > 1-17 (@97.5%) 


0 

1 (3%) 


38 (2%) 
44 (3%) 


3 (3%) 
1 (1 %) 


Dynamic sites 


Highly rigid: RMSFnorm < -0.607 (@0.5%) 
Highly dynamics: RMSFnorm > 1-195 (@99.5%) 


0 

1 (3%) 


48 (3%) 
48 (3%) 


2 (2%) 

3 (3%) 


Flexible sites 


Conformationally rigid: P(Flexible) < 0.158 (@2.5%) 
Conformationally flexible: P(Flexible) > 0.860 (@975%) 


1 (3%) 
1 (3%) 


34 (2%) 
52 (3%) 


2 (2%) 
0 


Sequence optimality 


Highly non-optimal: r < —5 kcal/mol 


1 (3%) 


49 (3%) 


5 (5%) 



All cutoff values were defined exclusively for gene set 1. The counts and percentages of variants with extreme values in each of the three variant classes are 
included. 



Table 3 | T-test statistics for gene sets 1 and 2. 



Parameters Prob > |t|, (t ratio), df 



t 



Set 1 (57 genes) Set 2 (36 genes) 



Sequence-based deleterious scores SIFT 

PolyPhen2_HDIV 
PolyPhen2_HVAR 
LRT 

MutationTaster 
M utationAssesssor 



<0.0001* (6.60) 

<0.0001* (8.15) 

<0.0001* (10.22) 

<0.0001* (4.19) 

<0.0001 * (9.39) 

<0.0001* (15.30) 



df 1704 <0.0001* 

df 1772 <0.0001* 

df 1772 <0.0001* 

df 1734 0.0066* 

df 1674 <0.0001* 

df 1772 <0.0001* 



(4.61) df598 

(738) df 657 

(8.70) df 657 

(2.73) df 654 

(5.57) df 631 

(6.06) df 652 



Sequence conservation scores GERP <0.0001* (5.30) df 1772 <0.0001* (3.92) df 657 

phyloP <0.0001* (4.81) df 1772 0.0010* (3.29) df 657 

SiPhy <0.0001* (5.89) df 1771 <0.0001* (4.95) df 657 



Structure-based scores 



AAG 


<0.0001* 


(4.81) 


df 1772 


<0.0001* 


(3.83) 


df 657 


B-factor 


0.0190* 


(-2.35) 


df 1756 


0.2450 


(1.16) 


df 657 


RMSF 


0.2304 


(-1.20) 


df 1756 


0.2264 


(1.21) 


df 657 


P(Flexible) 


0.1185 


(-1.56) 


df 1772 


0.7594 


(-0.31) 


df 657 


r 


0.9090 


(-0.11) 


df 1772 


0.2308 


(-1.20) 


df 657 


RSA 


0.0035* 


(-2.93) 


df 1772 


0.0658 


(-1.84) 


df 657 



All tests were performed on a subset of SNPs whose protein structures pass quality validations. Significant statistics indicate different means between negative 
and positive SNPs. 

tStatistic parameters include the two-tailed p-value, value of the t-statistics (t ratio), and the degree of freedom (df). Significant p-values (a = 0.05) are designated 
with The number of "df" equals to n-2 samples used in the analysis. 



heterozygous nsSNPs, with a genotype exclusive to the case group, 
no significantly over-represented biological pathways were found 
(Heinzen et al., 2012). Using an alternative gene group profiling 
method (Reimand et al., 2011), we also did not observe a statis- 
tical abundance for any biological terms derived from gene set 1 . 
However, this method did reveal that ~40% of genes in set 2 have 
roles in transmission of nerve impulses, ion channel complexes, 
or ion gated channel activity. A text mining method (Hakenberg 
et al, 2012) discovered only 1 gene from set 1 that may be 



linked to epilepsy. This gene codes for a ubiquitin-like modi- 
fier activating enzyme 2 (UBA2), a drug metabolizing enzyme 
that plays roles in GABAergic and cholinergic neuronal develop- 
ment (Kitamura et al., 2009). Specifically, mutations in ubiquitin 
protein ligase along with disruptions in the important neuronal 
GABA receptor genes are suggested to induce seizure (Olsen, 
201 1). By contrast, all genes in set 2 are suspected to be involved 
in a wide range of epilepsy disorders (Mottaz et al, 2010; Luu 
et al, 2012). Also note that the proportion of variants in cases 
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FIGURE 2 I Density plots of six deleterious scores for Case, Neutral, 
and Causal SNPs. By most of tlie standard deleteriousness scores, the 
distributions of cases in gene set 1 (Panel A) are closer to the neutral 
than the causal variants, and the neutral and causal variants are 
significantly different. The "known epilepsy" dataset (gene set 2, Panel 
B) demonstrated similar results. In this gene set, variants documented to 
cause epilepsy were regarded as "cases," while variants associated with 



other disease types were considered as "causal SNPs (positive control)." 
Although three prediction programs (SIFX Polyphen2_HDIV, and 
Polyphen2_HVAR) suggested case and causal SNPs share similar 
distributions of deleterious scores, the remaining three programs illustrate 
their prediction algorithms do not favor the two types of causal variants 
equally. More importantly, case SNPs (epilepsy-causing SNPs) resemble 
neutral SNPs more than the causal ones. 



relative to controls is much lower for genes in set 1 than in set 2 
(Table 1). 

Annotation of altered amino acid residues indicated some con- 
sistent patterns between case and causal variants. When we per- 
formed sequence feature searches (Uniprot Consortium, 2012) to 
compute the number of variants localizing in structurally or func- 
tionally important sites of a protein chain, we observed that more 
than half of the positive SNPs in both gene sets 1 and 2 were 
predicted to locate in transmembrane, topological domain, or 
repeat regions. Similar patterns were found for the case-exclusive 
epilepsy variants. 



STATISTICALLY SIGNIFICANT DIFFERENCES BETWEEN POSITIVE AND 
NEGATIVE SNPs 

Table 3 documents that all deleteriousness scores, all conser- 
vation scores, and some of structure-based scores have signif- 
icantly different means between negative and positive SNPs. 
The single parameters that best differentiate the groups are 
the MutationAssesssor prediction for gene set 1 and the 
PolyPhen2_HVAR prediction for gene set 2, but more notable is 
the highly significant differentiation for all of the sequence-based 
scores. Although the t-ratios are consistently lower in set 2 than 
set 1, the smaller sample of genes precludes inference that there is 
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FIGURE 3 I Density plots for relative solvent accessibility and free 
energy change for Case, Neutral, and Causal SNPs. Tlie two 
structure-based scores demonstrate that Case and Causal SNPs sliare similar 



characteristics. Results were obtained from two independent sets of genes 
(Panels A,B). Note the shift of the blue curves (cases) toward the causal 
SNPs (red) and away from the neutral ones (green). 



Table 4 | Fisher's exact test statistics for gene sets 1 and 2. 



Enrichment types 


Significant features 


Fisher's exact test (one-tailed) 






Set 1 (57 genes)^ 


Set 2 (36 genes)* 


Enriched in causal SNPs 


Deleterious count > 4 


<o.ooor 


< 0.0001 * 




Conservation count = 3 


Enriched in neutral SNPs 


< 0.0001* 




Induce large amino acid change (Grantham score > 100) 


<0.0001* 


< 0.0001* 




Induce disulfide change 


NS (p= 0.6081) 


< 0.0001* 




Induce glycine/proline change 


0.0003* 


< 0.0001* 




Locates in buried site (RSA < 20%) 


0.0109* 


< 0.0001* 




Locate in conformationally rigid site (P{Flexible) < 0.74) 


NS (p = 0.0558) 


0.0060* 




Locate on protein patch 


<0.0001* 


< 0.0001* 




Locate in protein domain 


0.0204* 


< 0.0001* 




Strongly reduce protein stability (AAG > 4kcal/mol) 


NS (p= 1) 


0.0400* 




Reduce protein stability (AAG > 0.5 kcal/mol) 


0.0009* 


< 0.0001* 


Enriched in neutral SNPs 


Conservation count = 3 


<0.0001* 


Enriched in causal SNPs 




Locate in highly dynamics site (B-factornorm > 97.5%) 


NS (p= 0.2671) 


0.0224* 




Locate in highly flexible site (P(Flexible) > 975%) 


0.0468* 


NS (p = 0.1432) 



High sequence conservation (conservation count = 3) is a feature that enriched in causal SNPs of gene set 2, but is more lil<ely to be found in neutral SNPs of gene 
set I 

^Data for gene set ? includes 100 causal SNPs and 1674 neutral SNPs. 

*Data for gene set 2 includes 289 causal SNPs (184 epilepsy case variants plus 105 non-epilepsy positive control variants), and 554 neutral SNPs. 
Significant p-values (a = 0.05) are designated with Non-significant test statistics are labeled with NS, followed by the correspondent p-value. 



a difference between the two sets. Notably, three of the structure- 
based measures are also at least nominally significantly different 
between positive and negative control variants in set 1, and trend 
in the same direction in set 2: AAG protein stability, B-factor 
protein flexibility, and RSA. 

We converted several of the continuous structural measures to 
categorical "normal" vs. "extreme" values and compared profiles 



of disease causal variants with those of neutral variations in 
gene sets 1 and 2. Table 4 reports the list of significantly distinct 
sequence/structural features for each variant category based on 
Fisher's exact test. Seven significant characteristics of causal vari- 
ants found in the 57 genes in set 1 are: having a high deleterious 
count (>4 out of 6 scores), introducing an amino acid change 
with large physico-chemical dissimilarity, inducing glycine or 
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proline change, being situated in a protein domain, buried site or 
on a conserved protein surface, and causing at least mild protein 
destabilization. By contrast, negative control SNPs of this gene 
set were found to be enriched in conserved variants (conserva- 
tion count = 3 out of 3) and generally locate in conformationally 
flexible sites. 

These findings were validated with a parallel analysis of gene 
set 2, although in this case we gained statistical power by combin- 
ing the set of epilepsy case variants with the non-epUepsy positive 
control SNPs (combined disease-SNPs n = 289). Each of the sig- 
nificant features detected in gene set 1 replicates in set 2, and 
additional evidences that disruption of disulfide bonds and loca- 
tion in conformationally rigid sites differentiate neutral and dis- 
ease variants were obtained. These findings emphasize that causal 
variants are likely to affect protein functional sites, a conclusion 
that can only be obtained from structure-based analysis. 

STRUCTURAL FEATURES PREDICT DELETERIOUSNESS OF CASE SNPs 

Next, we asked how the distribution of scores for the putative 
epilepsy-case variants compares with the negative and positive 
controls. By most of the standard deleteriousness scores, the dis- 
tributions of cases are closer to the negative than the positive 
controls in both datasets (Figure 2). We conclude that there is lit- 
tle evidence from standard measures for enriched deleteriousness 
in the case variants from the epilepsy study. Similar observations 
were found for Grantham score, protein flexibility parameters, 
and a few other structural measures (data not shown). 

However, we determined that the solvent accessibility mea- 
sure (RSA) and a protein stability measure (A AG) assign case 
variants to be more comparable to positive than negative con- 
trols (Figure 3A). Likewise, the same two structural parame- 
ters place "known epilepsy" variants in gene set 2 (Figure 3B) 
closer to the distribution observed in other disease-causing muta- 
tions. This analysis emphasizes the potential for structure-based 
deleteriousness measures to generate predictions that are more 
discriminating than those derived from measures of sequence 
conservation. 

To obtain a list of high-priority functional nsSNPs for epilepsy, 
we applied the deleterious structure predictions enriched in pos- 
itive SNPs (Table 4) to all candidate epilepsy variants and iden- 
tified the ones with high SDSs (score >4 out of 7). A list of 
14 high-likelihood structure-disrupted variants from 30 missense 
mutations was generated. To account for differences in the bur- 
den of mutations among genes, we used the Residual Variation 
Intolerance Score (RVIS) (Petrovski et al., 2013) to identify and 
compare the levels of mutational intolerance of each gene in our 
two datasets. The parameter determines the deviation of observed 
vs. expected numbers of common variants in a gene. Petrovski 
et al. (2013) found that genes which carry many common muta- 
tions (large positive RVISs) are less likely to influence disease 
development. Comparison of average RVIS between genes in sets 
1 and 2 indicated that the two groups do not have the same toler- 
ance to variations (p-value 0.0011, two-tailed f-test). The average 
RVIS for gene set 1 is 0.39 (range 0.11-0.67, 95% CI) whereas the 
value for gene set 2 is -0.40 (range -0.77 to -0.03, 95% CI). As 
expected, lower RVISs were observed for the documented disease 
causal genes (gene set 2). The finding is consistent with low RVISs 



in many OMIM genes (Petrovski et al., 2013). Nonetheless, the 
positive average RVIS for gene set 1 is not surprising; among the 
57 genes in set 1, 38 genes (67%) are classified as being high 
tolerance. 

Sub-classification of our 14 high SDS structure-disrupted vari- 
ants yields 9 and 5 genes that are highly acceptable or tolerant of 
mutations, respectively (Table 5). Although variants in genes with 
low tolerance (negative RVISs) are more likely to be deleterious, 
our SDS focuses at the variant level, and the structural analy- 
sis potentially provides novel intuition that is not apparent from 
any sequence- or gene-based analysis. Therefore, the indication 
of many "high tolerance genes" in our dataset does not preclude 
potential functional effects of specific variants. For this reason, we 
performed in-depth literature searches on all 9 and 5 variants of 
the 2 subgroups, and provide our inference of the likelihood that a 
particular SNP may contribute to the epilepsy disorders (Table 6). 
Interestingly, we found that half (4 out of 9) of the SNPs in high 
tolerance genes have potential to promote epilepsy. The propor- 
tion is comparable (2 out of 5) for variants located in the low 
tolerance genes. Of the 9 variants in high tolerance genes, one has 
a structural feature that is compatible with those of neutral SNPs, 
i.e., the variant alters a highly flexible protein site; therefore, we 
disregarded it as a putative functional variant. Similar considera- 
tion of the low tolerance genes suggests that just one, PPP1R27, 
is likely to harbor a mutation that promotes epilepsy. This leads 
to prioritization of 9 "high-priority putative functional variants 
for epilepsy." The locations of each of these variants with respect 
to their protein 3D structures are shown in Figure 4, and each is 
discussed below (variant numbering follows Table 5). 

INDIVIDUAL ASSESSMENT OF PUTATIVE STRUCTURALLY 
DELETERIOUS VARIANTS 
Cys1359Arg in ABCA6 

Cysl359 mutation is located in a buried site of ABCA6, a pro- 
tein that plays roles in macrophage lipid homeostasis (Figure 4A). 
Despite the high SDS of this variant (score 5/7), it has not been 
associated with diseases. The functionality of cysteine residue 
depends largely on the protein structure and its cellular loca- 
tion. For this instance of Cys to Arg change, it does not alter the 
pattern of disulfide bonds, partly due to the rarity of this bond- 
ing type within membrane proteins (Betts and Russell, 2003). 
However, the C1359R mutation is still considered as a crucial 
change (Grantham score 180, AAG 1.64kcal/mol), especially 
when the mutation occurs in the middle of protein domain. 

Arg277Gl¥ in ABHD14A 

Arg277 is located on an exposed site of ABHD14A (Figure 4B). 
Another member of this protein family, ABHD5, is thought 
to be responsible for a rare genetic disorder called Chanarin- 
Dorfman syndrome. This is the only variant among the 14 struc- 
tural disrupted case SNPs that was not detected as "deleterious" 
by any sequence-based algorithms (deleterious count 0/6), but 
our SDS suggests it has potential impact on the protein (SDS 
4/7). The wild type Arg227 residue has low sequence optimality 
(r — 0.58kcal/mol), a likely indicator that it has close proximity 
to a catalytic site. Moreover, the Arg to Gly change is consid- 
ered as an unfavorable substitution, especially when it occurs in a 
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ABCA6 (Cysl359Arg) ABHD14A (Arg277Gly) AL0X12 (Arg404Gln) 




FIGURE 4 I 3D structures for the nine high-priority variants for epilepsy. 

The nine high-priority variants for epilepsy include eight structural disrupted 
variants (A-H) which locate in genes with high tolerance of polymorphisms 
(Table 5) and one structural disrupted variant (M) which resides in a low 
tolerance gene (Table 5). In all figures, mutated protein residues caused by 
the SNPs are indicated with blue surface representations along with the 
amino acid name (wild type) and residue index. Yellow surface 
representations refer to predicted binding sites or catalytic sites, except in 



(M) in which it represents a protein residue with close proximity to the 
altered amino acid. Orange representations in (C) indicate the predicted 
best conserved residue cluster on the protein surface. The substitution of 
Ser150Cys in (E) adds one favorable hydrophobic residue (green surface 
representation) to the core of the a/p horseshoe. The magenta sticks in (H) 
represent the partner residue, Glu117, which forms a salt bridge with wild 
type Argl37. This salt bridge is lost in the presence of mutant Cys at 
position 137. 



Structured protein, as the lack of a side chain in Gly may diminish 
proper protein folding or intermolecular interactions. 

Arg404Gln in AL0X12 

Arg404 is found in a buried site of ALOX12 (Figure 4C). Arg404 
is in close proximity with the catalytic site of this protein and is 
also predicted to be part of the protein patch for intermolecu- 
lar interactions. The Arg404Gln substitution within this protein 
is also predicted to cause a slight protein destabilization (A AG 
0.90kcal/mol), despite the acceptable Grantham substitution 
score (<100). We suspect the variant may play roles in epilepsy 
implication, since a link between its substrate (arachidonic acid) 
and seizure susceptibility had been proposed (Cole-Edwards and 
Bazan, 2005). 



lle463Thr in DDX52 

Ile463 is located in a buried site of DDX52 (Figure 4D). A 
study of seizure susceptibility in Drosophila discovered that a 
gain-of-function mutation in the maleless helicase gene can sup- 
press seizure susceptibility in bang-sensitive Drosophila mutants 
(Kuebler et al, 2001). The Ile463Thr mutation in this pro- 
tein affects the structural integrity of the protein since it is 
detected as a SC, consistent with its destabilization effect (A AG 
1.37kcal/mol). 

SerlBOCys in EYPC 

The SerlSOCys variant is located in the core of the leucine- 
rich repeat (LRR) structural motifs of EYPC (Figure 4E). Several 
disulfide bonds are formed between cysteine clusters that flank 
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the LRRs, providing additional structural support, but no changes 
of disulfide bonds or hydrogen bonds appear in the mutated 
protein. SerlSOCys has minimal impact on protein stability 
(A AG = — 0.41 kcal/mol), and while the substitution induces 
a large physico-chemical change (Grantham score > 100), the 
substitution is considered neutral if located in a/p protein (Xu 
et al., 2010). Indeed, we observed SerlSOCys added one favorable 
hydrophobic residue to the core of the a/p horseshoe. Despite 
its high SDS (4/7), we consider it unlikely that this variant 
contributes to any disorders. 

AspSOBGly in HELB 

Asp506 is situated in a buried site of HELB (Figure 4F) and sev- 
eral of its features are predicted to interfere with ligand binding. 
First, wild type Asp506 has an exceptionally low sequence opti- 
mality value (F — 4. 14 kcal/mol). Second, wild type Asp506 is 
predicted to be a highly flexible site by three parameters, although 
the values are not extreme. Third, the Asp506Gly substitution is 
predicted to destabilize the protein (A AG 1.40 kcal/mol). These 
characteristics coincide with a recent mutagenesis experiment 
that proves Asp506 is part of a binding motif, and the mutation 
of D506A induces loss of substrate binding when associated with 
E499A and D510A (Guler et al, 2012). Study of another human 
DNA helicase, Twinkle, demonstrated that two missense muta- 
tions were detected in patients with a wide range of psychiatric 
symptoms, including severe epileptic encephalopathy, possibly 
due to inefficient recovery from molecular stress (Lonnqvist et al., 
2009). HELB Asp506 is thus particularly interesting for further 
assessment of a role in epilepsy. 

Leu43Val in IAH1 

Leu43 is located at a buried site in lAHl, although it lies far from 
the substrate binding site (Figure 4G). The protein is of interest 
given that many antiepUeptic drugs are potent enzyme inducers 
and inhibitors of the cytochrome P450 system, which affects lipid 
and glucose metabolisms, as well as evidence that increased lipase 
level is one of the side effects of anti-psychotic and anti-epileptic 
drugs (Voudris et al., 2004, 2006). Substitution of Leu to Val is 
quite favorable in all protein folding types (Xu et al., 2010), but 
Leu43Val in this protein is suspected to reduce protein stability 
(A AG 2.06 kcal/mol). A plausible explanation may be that this 
protein is highly structured, comprising only a few loop residues. 
The turn regions may play an essential role in bringing together 
and enabling or allowing interactions between regular secondary 
structure elements. 

ArgWCys in NMUR1 

Argl37 is situated in the central cavity of NMURl (Figure 4H). 
The presence of the a mutant Cys at position 137 diminishes the 
stabilizing salt bridge between wild type Argl37 and Glul 17, and 
results in a decrease in protein stability (A AG 1.83 kcal/mol). The 
gene has no known associations with any diseases. 

lle112MetinPPP1R27 

Ilell2 is found at a buried site of PPP1R27 (Figure 4M). A 
study of a similar protein, PPP1R3C, identified one missense 
mutation that may lead to a mild phenotype in Lafora disease — 
a teenage onset epilepsy disorder (Guerrero et al., 2011). In 



addition, protein phosphatase 1 (PPl) was identified as a member 
of long term potentiation (PTP) pathway in epUeptogenesis and 
epilepsy (KEGG: map04720) (Kanehisa and Goto, 2000; Lukasiuk 
and Pitkanen, 2012). Other genes in this pathway mostly reg- 
ulate neurotransmission and ion channel receptors. The He to 
Met substitution is quite favorable both in general (Grantham 
score < 100) and in distinct types of protein folding (Xu et al., 
2010). However, the longer aliphatic side chain of Met creates 
steric clashes with Alal04 of an adjacent helical region. This sin- 
gle point mutation is predicted to destabilize the protein (AAG 
1.41 kcal/mol). 

STRUCTURAL DISRUPTION SCORE CORRELATES WITH 
SEQUENCE-BASED DELETERIOUS SCORE 

Finally, we tested whether the SDS correlates with measures not 
used to construct the score itself We performed an additional 
sequence-based deleterious prediction of case SNPs using Condel: 
a weighted score that integrates the output of five tools (Gonzalez- 
Perez and Lopez-Bigas, 2011), three of which were used in our 
analysis. There is a significant positive correlation between our 
SDS and the Condel score (p-value 0.0342, n = 30). The trend 
persists after removing the sequence-based deleterious scores 
from SDSs of 16 variants (the variants have deleterious count > 4 
out of 6), although the significance in correlation is reduced to a 
marginal level (p-value 0.0717). In addition, we performed a sim- 
ilar analysis by sequentially removing one SDS component from 
the total score, and found the significantly positive correlation 
between SDS and Condel is maintained (Table?). The weak- 
est structure-based parameter among all of the SDS components 
is the classification of buried vs. exposed site using RSA. After 
removing this indicator from 15 variants, we did not detect any 
correlation between the two measures {p-value 0.13). Note that 
this analysis is complicated by the small sample sizes. Nonetheless, 
the finding supports our expectation that although the deleteri- 
ous count is one of the major components for constructing our 
SDS, the remaining non-conventional deleteriousness parame- 
ters also have substantial impact on the overall missense variant 
evaluation. 

DISCUSSION 

CURRENT PERSPECTIVES IN PRIORITIZATION OF EPILEPSY VARIANTS 

The evaluation and prioritization of candidate case variants 
is particularly difficult when the disorder involves a dissimi- 
lar set of genes, as is the case for epilepsy disorders, which are 
now known to involve diverse molecular pathways (Noebels, 
2003; Garofalo et al., 2012). The classic epilepsy genes (ion 
channel genes/regulatory genes, neurotransmitter genes/receptor 
genes/regulatory genes, genes that disrupt cortical circuits, and 
genes that lower the convulsion stimuli) rarely present in genomic 
data from sequencing studies. This may be because the classi- 
cal epilepsy mutations are Mendelian, whereas exome sequenc- 
ing likely targets more polygenic cases, noting that only 1% of 
epilepsy disorders are inherited in a Mendelian manner (Cavalleri 
andDelanty, 2012). 

Ferraro et al. (2012) highlighted how daunting the task is 
for epilepsy when they exemplified some factors in the design 
of cohort studies that influences the discovery of true positive 
epileptic variants: the selection of cases (presence/absence of the 
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Table 7 | Step-wise analysis for correlation of SDS and Condel score. 



SDS parameters 


r2 of 


P-value of 


# of variants 




linear fit 


correlation 


affected by the 
revised SDS^ 


All SDS 


0.1504 


0.0342** 


none 


SDS-high deleterious count 


U. 1 1 1 z 


U.U/ 1 / 


1 D 


SDS-large amino acid change 


0.1508 


0.0340** 


5 


SDS-induce gly/Pro change 


0.1308 


0.0496** 


7 


SDS-locate in buried site 


0.0782 


NS (p= 0.1344) 


15 


SDS-locate on protein patch 


0.1629 


0.0270** 


3 


SDS-locate in protein domain 


0.2093 


0.0110** 


25 


SDS-reduce protein stability 


0.1243 


0.0560* 


20 



Initial SDS includes 7 parameters, described in Table 4; the maximum SDS for 
eacfi variant is 7. The revised SDS calculates the score by exclude one SDS 
component at a time; the maximum revised score for a variant equals 6. 
'^The full dataset has 30 missense variants. All data points were used to test for 
a correlation between SDS and Condel score. When an SDS component was 
removed during the step-wise analysis, the SDSs for some numbers of variants 
were affected, i.e., the excluded parameter was applicable to the variant. For 
such cases, the correlation analysis was performed with all of the 30 data points, 
minus the number of exclusions indicated in the last column. Significant p-values 
are designated with """ and "' " for a = 0.05 and a = 0.10, respectively. Non- 
significant test statistics are labeled with NS, followed by the correspondent 
p-value. 

cause of symptoms), the seizure types (determine the amount 
of genetic influences), the patient profiles (age at onset, gen- 
der, characters of seizure incidences, etc.), and the assumption 
of genetic patterns (common variant effects, rare variant effects, 
or a combination of both) (Ferraro et al., 2012). Some authors 
are starting to incorporate disease information as prior knowl- 
edge in the probabihstic evaluation of candidate causal SNPs 
(Yandell et al, 2011; Sifrim et al, 2013; Robinson et al, 2014), 
but the scarcity of knowledge related to epilepsy genes limits this 
approach. Indeed, we found that the genes in our dataset are 
somewhat poorly understood and their disease contributions are 
largely unknown. 

In addition to SNPs, structural variations have been shown 
to associate with epilepsy genetics. lia et al. (2011) used step- 
wise enrichment analysis of protein-protein interactions to derive 
a molecular network of 20 high priority candidate genes linked 
to copy number variation (Jia et al, 2011). Interestingly, the 
genes do not overlay with the 68 homozygous nsSNP-containing 
genes in our dataset [but they do overlap with 4/1604 genes 
that harbor heterozygous variants: (Heinzen et al., 2012)]. An 
independent comparison with nine genes that harbor de novo 
mutations (identified from trios — unaffected parents and their 
affected child) also found no overlap (Epi4k Consortium and 
Epilepsy Phenomegenome Project, 2013). 

Consequently, we cannot be sure that any of the variants dis- 
cussed in this article are truly causal for epilepsy. However, the 
variant prioritization scheme does suggest a reduced number 
of candidates which, on the basis of careful curation of pro- 
tein structure, might be taken forward for targeted experimental 
manipulation and assessment of biological function in cell lines 
or model organisms. 



KEY FINDINGS 

We have developed a structure-based variant analysis protocol 
that evaluates the effects of missense mutations with respect to 
their predicted effects on protein features, such as solvent accessi- 
bility, stability, and flexibility. Replicated trends for putative case 
SNPs to have aberrant structural features that more closely match 
those of established disease mutations in the same proteins, than 
to those of neutral polymorphisms, establish the potential util- 
ity of this approach as an orthogonal protocol to sequence-based 
assessment of deleteriousness. 

Starting with 71 genes harboring putative case SNPs from an 
exome sequencing study of epilepsy disorders (Heinzen et al, 
2012), we were able to perform the assessments on 57 gene prod- 
ucts. Presumably only a fraction of these are actually causal, so 
our expectation was simply that the distribution of risk scores 
may be shifted from neutral toward disease-associated. Several 
features were observed to classify SNPs into two groups: puta- 
tive functional variants, or presumably neutral variants, and a 
composite risk score based on summation of these features high- 
lighted nine putative functional variants, from thirty exclusive 
missense mutations whose protein 3D structures are available. 
Although none of these has been previously linked to epilepsy 
disorders, detailed case-by-case analysis strongly suggests that 
several should be prioritized for further functional evaluation. 

Although our structure-based analysis only captured a fraction 
of variant residues due to the limited availability of 3D structures 
(Table 1), we show that 44, 32, and 75%, of case, negative, and 
positive variants from the first gene set are amenable to structure- 
based predictions. The equivalent percentages for set 2 genes are 
49, 37, and 51%, respectively. More importantly, we were able to 
represent 84% of the proteins in our first set, and 86% in the 
second set, despite the difficulty in generating high quality struc- 
tures. Our preliminary analyses suggest that it is not necessary 
to have structural data for all variants in order to construct the 
SDS. Since the sequence-based predictions of variants with struc- 
ture data are similar to those obtained for all SNPs in the dataset 
(data not shown), we are confident that the conclusions from our 
structure-based variant assessment protocol can be extended to 
the complete SNP pools in each of the two gene sets. 

With our combined sequence- and structure-based analysis 
pipeline, we discovered that some features are predominantly 
found among negative or positive SNPs. Structure-based param- 
eters contribute as much as 50% of the features that differentiate 
the two types of variants. There are several key observations from 
our feature enrichment analysis. First, we noticed seven common 
characteristics that are predominantly found in the positive con- 
trol SNPs, regardless of the set of genes. SNPs with strong effects 
are those that: have deleterious count > 4, have Grantham score > 
100, induce glycine or proline changes, locate in protein domains, 
in buried sites, or on conserved protein patches, and destabi- 
lize the protein. Second, variants with neutral effects (negative 
SNPs) have a few strong enriched features. In gene set 1, we found 
negative SNPs are more likely to be in conformationally flexible 
sites. A similar feature was detected in gene set 2, in which non- 
damaging SNPs are mostly highly dynamic sites. An additional 
distinctive characteristic of the negative SNPs in gene set 1 is that 
they tend to affect highly conserved residues. This finding appears 
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to be counter intuitive; we suspected that this unique observation 
seems to be an exception for this particular gene group. 

STUDY LIMITATIONS 

A primary limitation of our approach is the requirement for 
homology models that support computational prediction of 
structural characteristics. Specifically, only 18/68 proteins had at 
least partial experimental structures, so homologous templates 
were used in most cases. These were not available for just 9% 
of the proteins (« = 6), but the retained models did not cover 
the disrupted site for another 56% (36/68 variant sites), and 24 
of the potentially disrupted proteins are larger than 1000 amino 
acids, which also presents additional challenges for building mod- 
els and satisfying quality settings. Nevertheless, by restricting the 
modeling to domains, we were able to model 84% of the 68 
candidate genes (covering 44, 32, and 75%, of case, negative, 
and positive variants). This is a clear improvement on the auto- 
mated pipelines used in training algorithms such as PolyPhen2. 
We also ensured that the quality of the models was validated 
wherever possible, which also introduces an intensive manual 
curation requirement into the analytical pipeline, requiring some 
knowledge with methods that most genomicists are not familiar 
with. 

An analytical limitation is that the size of the dataset is rela- 
tively small, since only one variant per gene was studied, and just 
68 proteins were available to begin with. Since these are struc- 
turally diverse, it is likely that different aspects of protein structure 
are affected and the probability of enrichment for any one struc- 
tural feature is correspondingly reduced. While approximately 
three quarters of amino acid changes leading to Mendelian dis- 
eases consistently induce protein destabilization (Yue et al., 2014), 
the structural consequences of missense variants in complex dis- 
eases such as the epilepsy disorders are likely to be of a more 
diverse nature. 

STUDY INNOVATIONS 

The SDS offers a novel strategy for genomic profiling of variants 
that have uncertain but likely weakly deleterious function. It com- 
bines similar instances of variants with respect to their predicted 
impact on aspects of protein structure, allowing joint assessment 
of the impact of the variants as a class on biological function. 
Instead of evaluating each variant one by one, SDS provides a 
ranking that might be used to guide downstream experimental 
and/or clinical evaluations. 

Other studies using structural biology approaches to exam- 
ine the variant effects have considered a large number of variants 
per gene, facilitating direct contrasts of variant characteristics and 
predictions for individual genes (Dybowski et al, 2011) or genes 
with similar structures/functions (lordan et al., 2011; Izarzugaza 
et al., 2013). The genomic data that we started with is rela- 
tively small by comparison. However, we provide an alternative 
structure-based approach that can accommodate the small num- 
ber of variants expected from exome sequencing samples as well 
as the large diversity of gene functions they will typically generate. 

Most importantly, our SDS score implementation assures that 
case variants share similar characteristics as those observed in 
causal variants, but not neutral variants known to reside in the 



same proteins. We have validated the findings in a replication 
dataset, and show how the approach can be used as a unique solu- 
tion to prioritizing case variants in unrelated genes. Though not 
providing a guarantee of disrupted function, the score should be 
considered as a complementary approach to existing sequence- 
based deleteriousness prediction. 

CONCLUSION 

Using the list of enriched features, we concluded that this novel 
structure-based assessment protocol for missense variant dele- 
teriousness has a potential to determine high-priority candidate 
variants suitable for experimental validations. The analysis may 
prove to be useful, particularly when traditional sequence-based 
predictions are inconclusive. An important question is whether 
the same structural attributes differentiate neutral and functional 
variants for different categories of diseases. 

Because our study employed large numbers of external 
resources (variant predictions, gene information, 3D structure 
modeling and quality controls, and sequence- and structure- 
based predictions), the analysis pipeline presented here is not 
readily automated. Aspects of it are in theory readily generaliz- 
able to all classes of proteins, and once all the above steps have 
been accomplished, the variant deleteriousness structure-based 
predictions could be effectively populated into a database. After 
that labor-intensive step is completed, the SDS for any variants in 
a dataset may be computed and retrieved virtually by combining 
the predictions for the genes specific for the study. 
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